Skip to main content

Release notes

Curated changelog for the scrapingpros Python SDK. Each release lists what changed from the user's perspective — bugs you'd actually hit, features you can use, things that may need attention.

📦 Installpip install --upgrade scrapingpros
🐍 PyPIpypi.org/project/scrapingpros
📚 Docsdocs.scrapingpros.com/docs/category/python-sdk
🔧 API statusapi.scrapingpros.com
🆓 Demo tokendemo_6x595maoA6GdOdVb (5,000 credits/month, no signup)

0.7.8 — 2026-05-31

Robustness patch on the v0.7.7 inline-result path. A single malformed inline body could surface as a ValidationError against the whole jobs listing page — losing every job on that page, not just the bad row. v0.7.8 isolates the failure: the offending row's result becomes None and the SDK's existing per-job GET /jobs/{id}/result fallback handles it, exactly as if the server had returned result=None to begin with.

Why

Production reports of pydantic.ValidationError raised from iter_run_jobs(include_results=True) (and transitively from Batch.iter_results()) during active draining. Jobs in transient states can arrive with result shapes that don't strictly match ScrapeResponse — a missing required sub-field, a sub-field with the wrong type, an unexpected payload. Before this patch, that single row aborted the whole page parse.

Changed

  • JobExecutionPublic.result parses tolerantly. A new @field_validator(mode="before") catches any exception raised while deserializing the nested ScrapeResponse and returns None. The well-formed rows on the same page parse normally. The malformed row's body is fetched via GET /jobs/{id}/result on demand by _build_result, same as the legacy / oversize / blob-miss fallback paths.

    Pre-existing behaviour for well-formed inline bodies is unchanged: those still parse to ScrapeResponse and skip the per-job fetch.

Notes

  • Strict superset of v0.7.7. No API changes.
  • If you reverted to include_results=False after hitting the v0.7.7 issue, you can re-enable it on v0.7.8.
  • The same tolerance does not apply to other listing fields. Anything else that fails to validate (e.g. an unexpected job-level metadata shape) still raises — those would be real schema drift the caller should see.

0.7.7 — 2026-05-31

The polling-efficiency story closes. Two changes that compound: the SDK now consumes the inline result body when the listing brings it, and tightens the counter-based polling so the brief window between "run reached terminal state" and "listing finished seeding" can't drop the last few jobs. Combined with the v0.7.3 (adaptive poll_interval) and v0.7.4 (counter short-circuit) work already on PyPI, a high-volume iter_results() consumer should see its polling traffic drop by roughly an order of magnitude — without any caller-side migration.

Added

  • include_results= opt-in on the jobs listing endpointclient.iter_run_jobs(...) and client.get_run_jobs(...) (sync + async) accept a new include_results: bool = False kwarg. When True, each completed job in the page carries its full body in the new JobExecutionPublic.result field, eliminating the GET /jobs/{id}/result per-job round-trip that previously dominated polling traffic. The server caps the page size to 100 when this flag is on (vs 1000 for metadata-only); the SDK mirrors that cap. Bodies above the inline cap (~256 KB), non-completed jobs, and listings from older servers all surface as result=None — the SDK falls back per-job to the existing /result endpoint for those.

    for job in client.iter_run_jobs(cid, rid,
    status_filter="completed",
    include_results=True):
    result = job.result or client.get_job_result(
    job.collection_id, job.run_public_id, job.job_public_id
    )
    process(result, job.custom_id)
  • JobExecutionPublic.result: ScrapeResponse | None — the inline body that include_results=True populates. None on the metadata-only path and for the fallback cases above.

  • RunPublic.all_jobs_persisted: bool | None — exposed in the model for callers that want to read it directly (it signals when the run's listing is fully drainable). The SDK's polling does not key off this field; instead, the iteration completeness guard described below uses the run counters + the set of jobs the SDK has already yielded.

Changed

  • Batch.iter_results (and AsyncBatch.iter_results) automatically use the inline-body path. Internally the polling tick now sends include_results=true to the listing endpoint and consumes job.result directly when present, falling back to the per-job /result only for the (~5%) of jobs the server marks result=None. Calling code is unchanged — no migration, no opt-in, no new kwargs. Existing iter_results() loops just see fewer outbound HTTP calls.

  • Iteration completeness guardBatch.iter_results (and async) only exit the polling loop once len(seen_job_ids) >= expected_terminal_count even if the run's status flipped to completed earlier. There is a brief window where the run reports status=completed while the listing is still seeding the final few jobs; the SDK previously could break out of the loop and drop them. Now it keeps polling until the listing catches up. The expected_terminal_count adapts to the include_failed flag (uses success_requests only when failures are excluded), so callers asking only for successful jobs still exit cleanly.

  • Job → result metadata propagation. result.custom_id and result.url are backfilled from the underlying JobExecutionPublic if the server didn't echo them on the body. Same behaviour regardless of whether the body came inline or via the /result fallback. Keeps the existing traceability contract intact.

  • status_filter on the low-level iter_run_jobs / get_run_jobs (sync + async) accepts a list or CSV string. Drains multiple terminal states in one paginated stream instead of one call per status. The high-level Batch.iter_results already drained in one stream internally; this is a public-surface fix for the low-level path:

    # Before — 3 round-trips:
    for status in ("completed", "failed", "timeout"):
    for job in client.iter_run_jobs(cid, rid, status_filter=status):
    process(job)

    # After (v0.7.7+) — 1 paginated stream:
    for job in client.iter_run_jobs(
    cid, rid,
    status_filter=["completed", "failed", "timeout"],
    ):
    process(job)

Impact

For a polling pattern that hits the high-level Batch.iter_results(), the per-tick wire cost evolves like this on a typical run with hundreds of completed jobs:

Tick componentv0.7.4v0.7.7
GET /runs/{rid} (status)11
GET /jobs?... (listing)0–1 (counter short-circuit)0–1 (counter + persisted short-circuits)
GET /jobs/{id}/result per completed jobN~0 (inline)

The per-job /result calls were the dominant cost on completed-heavy runs. With v0.7.7 they disappear from the wire on the happy path, and the listing-only short-circuits keep the idle-tick cost at a single status call.

Notes / migration

  • Strict superset of v0.7.5 / v0.7.6 (this release supersedes the unreleased v0.7.6 candidate; both sets of changes ship together). Every existing iter_results() / iter_run_jobs() / get_run_jobs() call keeps working unchanged.

  • Backward compat with older servers: JobExecutionPublic.result defaults to None and RunPublic.all_jobs_persisted is bool | None. Servers that don't return these fields still parse cleanly, and the SDK behaviour falls back to the previous polling pattern automatically (per-job /result, no persistence guard).

  • What you don't have to do: nothing. The inline-body consumption is internal to Batch.iter_results — no kwarg to set, no migration. If you're on the low-level iter_run_jobs path and want the same win, pass include_results=True explicitly.


0.7.5 — 2026-05-28

Cross-process resume — pass a cursor when reattaching to a batch and the SDK starts strictly after that point instead of re-yielding every job. Closes the last remaining gap that pushed restart-resilient pipelines onto the low-level path, so users who were rolling their own polling loops to get cursor support can now stay on iter_results and inherit the v0.7.3 / v0.7.4 polling optimisations for free.

Added

  • Batch.iter_results(since=...) and AsyncBatch.iter_results(since=...) — cross-process resume cursor. Accepts a datetime or an ISO 8601 string; jobs with completed_at <= since are skipped. The SDK uses the server-side since_completed_at filter so the resume pages only new jobs from the wire (no client-side dedup needed).

  • Batch.last_completed_at / AsyncBatch.last_completed_at — read-only property exposing the high-water mark the SDK tracks during iteration. Read it after each yielded result and persist alongside (collection_id, run_id) for cross-process resume:

    for result in batch.iter_results():
    save(result)
    db.update(cid=batch.collection_id, rid=batch.run_id,
    cursor=batch.last_completed_at)

    # Different process:
    for result in client.iter_results(saved_cid, saved_rid, since=saved_cursor):
    save(result)
  • since= on client.iter_results(cid, rid, ...) and AsyncClient.iter_results(cid, rid, ...) — shortcut propagates the kwarg to the underlying Batch.

  • since_completed_at= on client.iter_run_jobs(...) and client.get_run_jobs(...) (sync + async) — same primitive exposed on the low-level path. For users who keep a custom polling loop, this closes the parity gap with what Batch.iter_results uses internally. Documented as the canonical "rolling your own polling loop" recipe in Collections (low-level).

Notes / migration

  • Strict superset of v0.7.4. Cero ruptura: every existing iter_results() / iter_run_jobs() call without since keeps starting from the beginning, exactly as before.

  • Why this matters: pipelines that crash and restart, webhook handlers that reattach inside an HTTP request handler, dashboards that page through a long-running batch — all of these used to either re-process duplicates or write their own paginator around iter_run_jobs. Now the high-level path supports the pattern natively, and migrating to it brings the v0.7.3 (adaptive poll_interval) and v0.7.4 (counter short-circuit) optimisations along.

  • Concurrent multi-batch draining: the canonical pattern is still asyncio.gather over independent iter_results generators — each batch keeps its own adaptive cadence and short-circuit state, and the SDK doesn't add a coalesced drain_many because per-batch (cid, rid) are separate server endpoints with no bulk-status equivalent. See Batch API → Draining several batches for the 6-line recipe.


0.7.4 — 2026-05-28

Polling-side request reduction: when the run's aggregate terminal counters (success_requests + failed_requests + timeout_requests on RunPublic) haven't moved since the previous tick, iter_results now skips the jobs-page query entirely. The status response alone tells us nothing new happened — querying the jobs page is guaranteed to come back empty, so the round-trip is wasted.

Changed

  • Batch._collect_new_terminal_jobs / AsyncBatch._collect_new_terminal_jobs short-circuit on stable counters. Each idle polling tick now costs one request (run status) instead of two (run status + jobs page). The since_completed_at filter still guarantees we'd pick up any missed jobs on the next non-skipped tick, so this is a pure latency / request optimisation with no correctness impact.

Impact

Combined with the v0.7.3 adaptive poll_interval:

Scenariov0.7.2 pollingv0.7.3 pollingv0.7.4 polling
5,000-URL batch, 1 h run, jobs trickle in 20 bursts~1,440 req/h (5 s × 2)~240 req/h (30 s × 2)~140 req/h (30 s × 1 + 20 × 1)
5 such batches in parallel~7,200 req/h~1,200 req/h~700 req/h

The optimisation is invisible to callers — iter_results yields the same results in the same order; only the polling traffic drops.

Notes

  • Pure additive: no API changes, no behaviour changes for callers that pass an explicit poll_interval.
  • Safe even if the server's counters lag by one tick: the next non-skipped tick re-queries from the same since_completed_at high-water mark and catches everything.

0.7.3 — 2026-05-23

Adaptive poll_interval default — the polling cadence for iter_results and run_and_wait is now sized to the batch instead of a fixed 5 s / 2 s. Small batches stay responsive; long batches stop burning the rate budget on status checks. A pipeline running five 5,000-URL batches in parallel — each iterating at the old 5 s default — used to spend ~3,000 requests per hour just polling. With the new default it spends ~500, leaving the rest of the rate budget for actual scraping.

Changed

  • Batch.iter_results(poll_interval=...) and AsyncBatch.iter_results(poll_interval=...) — default changes from a constant 5.0 to an adaptive value picked from the batch's queued count:

    Items in queueDefault poll_interval
    < 1005 s (status quo)
    100 – 49910 s
    500 – 1,99915 s
    ≥ 2,00030 s

    Pass an explicit poll_interval=N to override. Same tier table applies to Batch.run_until_complete / AsyncBatch.run_until_complete and to the client.iter_results(cid, rid) shortcut.

  • ScrapingPros.run_and_wait(poll_interval=...) and AsyncScrapingPros.run_and_wait(poll_interval=...) — default changes from 2.0 to an adaptive value. run_and_wait ticks are cheaper than iter_results ticks (one status GET vs status + jobs page + N parallel result fetches), so the tier values are smaller:

    Items in queueDefault poll_interval
    < 1002 s (status quo)
    100 – 4995 s
    500 – 1,99910 s
    ≥ 2,00020 s

    Sized off the just-created run's total_requests. The 3,600 s timeout default is unchanged.

Added

  • scrapingpros.adaptive_poll_interval(items, *, kind="jobs"|"status") — public helper exposing the tier lookup. Use it to predict the default cadence the SDK will pick for a given batch size, or in your own monitoring loops:

    from scrapingpros import adaptive_poll_interval

    cadence = adaptive_poll_interval(len(my_urls)) # for iter_results
    status_cadence = adaptive_poll_interval(len(my_urls), kind="status") # for run_and_wait
  • INFO-level log line emitted once per iter_results / run_and_wait when the adaptive default is selected:

    INFO  Batch <run_id>: iterating 1500 item(s) with poll_interval=15s (auto).
    Pass poll_interval= to override.

    Visible if you've set the scrapingpros logger to INFO; silent at the default WARNING level. Helps surface the cadence in production logs without forcing it on every caller.

Notes / migration

  • No breaking changes. Any code that passed poll_interval= explicitly is unaffected. Code that relied on the implicit 5 s / 2 s now sees the adaptive value — same correctness, different cadence.

  • Same surface, smaller cost on multi-batch pipelines. If you orchestrate several long-running batches in parallel, you'll see the difference most. A single small batch behaves identically to v0.7.2.

  • When to override: pass poll_interval=N if you have a latency-sensitive UI (override to a lower value) or want even gentler polling than the default (override higher). The tiers are sized to be safe, not minimum.


0.7.2 — 2026-05-23

Surfaces the per-item validation buckets the server now returns when creating a collection. Before this release, an item rejected for a field-level reason (e.g. custom_id over 255 chars) used to fail the whole submit with HTTP 422; the API now bucketing those rejections into invalid_items and creating the collection with whichever items passed. v0.7.1 silently discarded those buckets — a caller submitting 1,000 items with 3 long custom_ids would see a Batch claiming 1,000 items while the server enqueued 997. v0.7.2 surfaces all of it.

Added

  • Batch.invalid_items (list of InvalidItem) and AsyncBatch.invalid_items — items the server rejected for Pydantic body-validation or parameter-rule reasons (custom_id too long, screenshot without browser, etc.). Each InvalidItem carries its 0-based index, the url if it could be read, and a list of InvalidItemError (field, error_type, message).

    batch = client.submit_batch("daily", items)
    for it in batch.invalid_items:
    print(f" - [{it.index}] {it.url}")
    for err in it.errors:
    print(f" {err.field}: {err.error_type}{err.message}")
  • Batch.duplicate_urls / AsyncBatch.duplicate_urls — the explicit list of URLs the server skipped as duplicates of an earlier item in the same submit (one entry per skipped occurrence). The legacy duplicates_skipped count is preserved as Batch.duplicates_skipped.

  • Batch.blocked_urls / AsyncBatch.blocked_urls — same BlockedURL shape that submit_batch_lenient returned in v0.5.3, now accessible directly on every Batch so non-lenient callers can inspect them too.

  • InvalidItem and InvalidItemError models exported from the top-level package.

  • NewCollectionResponse.invalid_items and NewCollectionResponse.duplicate_urls — already populated by the server; the SDK model was discarding them before this release.

  • Batch.summary() / AsyncBatch.summary() returning a frozen BatchSummary dataclass — the single call to get a complete end-of-run report. Counts both the items the server ran (queued / succeeded / failed) and the items it rejected at submit time (blocked / invalid / duplicates), so a caller no longer has to assemble that picture from four different attributes. str(summary) produces a multi-line ASCII block ready for logs:

    Batch summary (status: completed)
    submitted : 1752
    queued : 1749
    succeeded: 1701
    failed : 48
    rejected : 3
    blocked : 0
    invalid : 3
    duplicates : 0

    Canonical usage inside an on_complete callback so the report fires automatically when the run terminates:

    def report(b):
    print(b.summary())

    batch = client.submit_batch("daily", items).on_complete(report)
    for r in batch.iter_results():
    handle(r)
    # At loop exit, `report` has fired with the full picture.

    Invariants BatchSummary exposes (and the SDK enforces in tests): submitted == queued + blocked + invalid + duplicates after submit, and succeeded + failed == queued once is_finished is True.

  • Batch.submitted_count / AsyncBatch.submitted_count — the original payload length (what the caller handed to submit_batch). None on handles built via client.get_batch() (reattached after a process restart): the server does not echo back the original submit size on GET /v1/async/collections/{id} yet, so the SDK cannot recover it without persisting len(payload) alongside the IDs in your own storage.

  • get_batch(submitted_count=...) and client.iter_results(cid, rid, submitted_count=...) — pass-through hint for the reattach case. If you persisted len(payload) alongside (cid, rid), hand it back when reattaching and batch.summary() reports the full picture instead of submitted=None. Same kwarg on AsyncClient.get_batch / AsyncClient.iter_results.

    # On submit: persist alongside the IDs.
    db.batches.insert(cid=batch.collection_id, rid=batch.run_id,
    submitted=len(items))

    # On reattach (webhook, restart, dashboard):
    row = db.batches.find(cid=cid)
    for r in client.iter_results(cid, rid, submitted_count=row.submitted):
    handle(r)
    # The on_complete summary now shows the full picture.
  • AsyncClient.submit_batch_lenient(name, items) — async counterpart of the sync method. Same contract: returns (batch, blocked), does not emit the partial-success RuntimeWarning, and exposes batch.invalid_items / batch.duplicate_urls for the other two rejection buckets. Closes the asymmetry where the production-first async client lacked the opt-in handler for partial-success.

    async with AsyncClient(token) as client:
    batch, blocked = await client.submit_batch_lenient("daily", items)
    for b in blocked:
    log.warning("blocked %s (%s)", b.url, b.reason)
    async for r in batch.iter_results():
    handle(r)

Changed

  • submit_batch (strict) emits a RuntimeWarning when the server bucketed any items into blocked_urls, invalid_items, or duplicate_urls. The warning summarises the counts plus the first invalid item's field / error_type so the actionable detail is in the log without forcing the caller to inspect the batch. Visible by default on CPython; suppress with warnings.filterwarnings("ignore", category=RuntimeWarning, module="scrapingpros.batch"). The warning is opt-out by design: silent data loss is the failure mode this release closes.

    Same warning is emitted by AsyncClient.submit_batch.

  • Batch.total (and AsyncBatch.total) now seeds correctly at len(payload) − len(blocked) − len(invalid) − duplicates_skipped instead of len(payload). This affects the value of batch.pct, batch.processing_count, and batch.eta_seconds before the first server poll; once polling starts, the server-reported total_requests takes over (unchanged behaviour). Old code that read batch.total immediately after submit would have seen the wrong value when items were rejected — now it sees the queued count.

  • submit_batch_lenient does not emit the warning (lenient mode is the explicit opt-in for handling rejections). Its return signature is unchanged — (batch, blocked) — to avoid breaking existing callers. The other two buckets are accessible as batch.invalid_items and batch.duplicate_urls on the returned handle.

Notes / migration

  • A 100%-valid batch behaves identically to v0.7.1: all three buckets are empty lists, no warning is emitted, batch.total == len(payload). The change is invisible to callers that don't hit the partial-success path.

  • Server-side context: the API extended its partial-success contract from SSRF rejections (blocked_urls, since v0.5.3) to Pydantic body-validation and parameter rules. The "fail the whole batch on first invalid item" behaviour is gone.

  • Why a warning instead of an exception: raising would break clients whose inputs occasionally drift (a CSV with a stray long string in custom_id). The warning lets the caller see the problem during testing, decide whether to handle it via submit_batch_lenient or input cleanup, without forcing every caller to wrap submits in a try/except.


0.7.1 — 2026-05-18

Catches a pool-exhaustion failure mode that survives v0.7.0 when one AsyncClient is shared across a long prep phase and a parallel submit phase. v0.7.0's PoolExhausted correctly names client-pool saturation (vs. a server timeout), but it doesn't fix the underlying cause when the same client is reused across phases.

The failure shape

The script uses one shared AsyncClient for two consecutive phases:

  1. A "prep" phase: thousands of parallel GETs to fetch per-domain configs.
  2. A "submit" phase: dozens of parallel submit_batch calls right after.

By the time the submit phase starts, the shared httpx pool is full of keepalive / draining connections from the prep phase. Submit requests queue waiting for a free slot, eventually time out — even though the server is healthy.

Reproduction: any script that does ~20+ parallel GETs followed by ~10 parallel submits on one AsyncClient hits this. The fix is per-worker pool isolation.

Added

  • AsyncClient.submit_batches_concurrent(batches, *, concurrency=10) — submit many batches in parallel with per-worker pool isolation. Each worker spawns its own fresh AsyncClient inside an async with, so the submit pool is never starved by stale connections from a prior prep phase on the parent client. Returns a list[AsyncBatch] in input order; each handle is reattached to the parent client so iter_results works after the worker clients close.

    async with AsyncClient(token) as client:
    configs = await client.batch_scrape(catalog_urls) # prep
    batches = await client.submit_batches_concurrent( # submit
    [(f"daily-{i}", chunk) for i, chunk in enumerate(chunks)],
    concurrency=15,
    )
    for batch in batches:
    async for r in batch.iter_results():
    save(r)

    Measured on a reproduction with 25 batches × 500 items each (12,408 items total): submitted in 20 seconds with zero errors on the same script that previously failed with a shared client.

  • SyncClient.submit_batches_concurrent — same surface, backed by a ThreadPoolExecutor with one fresh SyncClient per thread. For most production code prefer the async variant; this exists so sync users don't get left behind.

  • BatchSpec input shapesubmit_batches_concurrent accepts either (name, items) tuples or {"name": ..., "items": [...], "callback_url": ...} dicts. items is the same shape as submit_batch (list of URL strings, dicts, or ScrapeRequest).

Changed

  • PoolExhausted message now includes pool stats when available, plus the actionable mitigations inline. Example:

    PoolExhausted: SDK connection pool exhausted (request to /v1/async/collections
    never reached the server). Pool state: 100/100 in use, 87 idle / keepalive,
    max_keepalive=100.
    Likely cause: a long prep phase on this client (many earlier GETs/scrapes) is
    holding connections, starving the submit phase.
    Mitigation: (1) for parallel batch submits, use
    client.submit_batches_concurrent(batches, concurrency=N) — each worker uses
    its own fresh client. (2) Or open a fresh AsyncClient just before the submit
    phase. (3) Or raise the pool: AsyncClient(token, http_limits=httpx.Limits(
    max_connections=500, max_keepalive_connections=200)).

    The pool-stats lookup is wrapped in try/except, so a future httpx internal rename can't break the error path — falls back to configured limits only.

Notes / context

  • This is a strict superset of v0.7.0 — no behaviour change for callers who don't use submit_batches_concurrent. The new method is purely additive.

  • The new docs section "When NOT to share an AsyncClient" (Batch API page) covers the fetch-then-submit anti-pattern with the canonical fix inline.

  • All HTTP and DNS in scrapingpros/ goes through httpx's async resolver — no socket.* / urllib.* blocking calls anywhere in the SDK.


0.7.0 — 2026-05-15

Production-first SDK. We removed the path that doesn't scale and surfaced the failure mode that mimicked a server error. If you used scrape_many, your migration is one word: batch_scrape. Same signature, same return shape, server-side scaling, automatic refunds on soft-blocks, resume after a crash.

Why this release

Three things crystallised at once:

  1. scrape_many doesn't scale. Benchmarked against the Collections-backed batch_scrape at production-relevant scale (N=1000 URLs, browser=True) on prod:

    MethodWallThroughput
    scrape_many930 s1.07 URLs/s
    batch_scrape185 s5.39 URLs/s
    submit_batch streaming215 s4.66 URLs/s

    batch_scrape is 5× faster at this size. scrape_many opens N parallel connections from your machine to the per-request endpoint; under load your local pool saturates and the per-request endpoint isn't built for sustained fan-out. Collections-backed methods send a single request to the queued endpoint and stream results back as they finish.

    Beyond the wall-time difference, the Collections-backed methods are also the path that gets you automatic credit refunds on soft-blocks (the validator detects thin / blocked content server-side and refunds without you having to inspect each response). On scrape_many, soft-blocked content arrives as a 200 with mojibake or empty body and you're billed for it.

  2. Pool-exhaustion errors used to point at the wrong layer. When a client legitimately ran asyncio.gather of many submit_batch calls and hit the SDK's local connection pool ceiling, the failure surfaced as SubmitTimeout pointing at the API endpoint — so users assumed the server was down or slow. The new PoolExhausted exception names the actual cause and tells you to raise the pool ceiling.

  3. SyncClient inside a running event loop is almost always a misuse. It blocks the loop on every call. We now emit a RuntimeWarning when detected so developers see the issue during testing instead of debugging slow async apps in production.

Breaking

  • scrape_many removed from both SyncClient and AsyncClient. The method is still defined (so callers don't see AttributeError) but raises RuntimeError with the migration recipe inline. The replacements:

    # Before
    results = client.scrape_many(urls, format="markdown", browser=True)

    # After (drop-in, returns the same shape)
    results = client.batch_scrape(urls, format="markdown", browser=True)

    # Or, streaming with live progress:
    for result in client.submit_batch("daily", urls).iter_results():
    ...

    Users on v0.5.1+ have had a DeprecationWarning on scrape_many calls for three releases. v0.7.0 makes the failure mode loud.

Added

  • PoolExhausted exception. Raised when the SDK's local HTTP connection pool saturates — i.e. the request never left your machine. Distinct from SubmitTimeout (which means the server failed to respond in time). Subclasses ConnectionError so existing except ConnectionError clauses still catch it. The error message points to the fix:

    from scrapingpros import AsyncClient, PoolExhausted
    import httpx

    try:
    await asyncio.gather(*(client.submit_batch(...) for ...))
    except PoolExhausted:
    client = AsyncClient(token, http_limits=httpx.Limits(
    max_connections=500, max_keepalive_connections=200,
    ))
    # ... retry
  • http_limits= constructor argument on SyncClient and AsyncClient. Pass any httpx.Limits to override the default pool ceiling (200 / 100). No more monkey-patching client._http.

  • RuntimeWarning when SyncClient is instantiated inside a running event loop. Catches the common misuse from AI-generated code and async refactors. Doesn't break anything — it's a warning, not an error.

Changed

  • download() deprecated. Both SyncClient.download(url) and AsyncClient.download(url) emit DeprecationWarning runtime. Since v0.6.0, scrape(url, browser=False) returns binary content natively (body_base64, body, save()) and covers the same use case with a richer surface. download() will be removed in a future major release.

    # Before
    result = client.download(url)
    data = base64.b64decode(result.content)

    # After (recommended)
    resp = client.scrape(url, browser=False)
    resp.save("file.pdf") # or: resp.body for bytes
  • Class-level docstrings rewritten on SyncClient and AsyncClient to make the canonical pattern obvious: SyncClient for REPL / one-off scripts, AsyncClient for production work. Both classes call every endpoint — the "sync" / "async" in the API URL refers to response delivery (inline vs queued), not the Python client.

Notes / migration

  • What if I really need to fan out from the client? You almost never do. The Collections API is faster and more reliable above ~200 URLs. Below that, submit_batch adds a small fixed overhead — at N=100 with browser=True, scrape_many was ~3× faster in absolute terms (42 s vs 135 s), but you give up refunds, resume, idempotency, and soft-block detection. For new code, prefer the Collections methods even at small N; you'll never have to revisit the choice when N grows.

  • Benchmark against your own workload to validate sizing decisions — pick N, browser=True/False, and target sites that match your real traffic. The 5× ratio above is for N=1000 with browser=True on a mixed real-site sample; small-N or non-browser numbers will look different.

  • scrape_many callers will see a RuntimeError on the next call after upgrade. The error message contains the migration recipe and a link to these notes; we deliberately did not leave a quiet AttributeError so the failure is actionable.

  • No changes to the wire format. The API endpoints, request/response shapes, and existing client methods are unchanged. This is an SDK-only release.


0.6.0 — 2026-05-14

Binary content support on ScrapeResponse. The API now returns PDFs / images / ZIPs / etc. directly from /scrape with three new fields (content_type, body_base64, body_url); previously the SDK silently discarded them and clients downloading files via /scrape got an empty html. v0.6.0 exposes the fields and adds five helpers modelled on requests.Response so downloading a PDF is a one-liner.

This is a minor bump (0.5.3 → 0.6.0) — zero breaking changes, all additions opt-in.

Added

  • ScrapeResponse.content_typestr | None. Standard HTTP Content-Type of the response (e.g. "text/html; charset=utf-8", "application/pdf", "image/png"). Populated by the API since 2026-05-14; None for legacy responses (treat absence as "text/html" for backward compat).

  • ScrapeResponse.body_base64str | None. Base64-encoded raw response body, populated only when the response is binary. Use the body / text / save helpers below to access the content; you almost never need to decode this field yourself.

  • ScrapeResponse.body_urlstr | None. Reserved for future blob offload of large binary bodies (>5 MB threshold). Currently always None — tied for forward compatibility so clients using download_body() don't break when the server starts populating it.

  • is_binary property — True iff the response is binary. Use this to branch before accessing html (empty for binary) or content (returns None for binary):

    resp = client.scrape("https://investors.example.com/charter.pdf")
    if resp.is_binary:
    resp.save("charter.pdf")
    else:
    print(resp.content)
  • body property — bytes. Mirrors requests.Response.content. Always returns bytes: decodes body_base64 for binary, UTF-8-encodes markdown or html for text. Raises ValueError for offloaded bodies (body_url), pointing you to download_body().

  • text property — str | None. Mirrors requests.Response.text. Returns None for binary so a careless str.find() call fails loudly instead of silently parsing base64.

  • save(path) — write the body to a file, return bytes written. Works for both text (UTF-8 encoded) and binary:

    client.scrape(pdf_url).save("out.pdf")
  • download_body() (async) + download_body_sync() (sync) — fetch body bytes whether they're inline (body_base64) or offloaded (body_url). The sync variant exists so SyncClient users don't need an event loop when the blob-offload path eventually lights up.

Changed

  • content property now returns None for binary responses (was returning the empty html string). This is a behaviour change for binary responses only — text responses are unchanged. The previous return value (empty string for binary) was effectively useless, so this is a clarity improvement, not a useful break.

Notes

  • Backward compat: every existing field is preserved, every existing accessor (html, markdown, guidance, statusCode, content for text) behaves identically. Code that didn't touch binary content sees no change.

  • MethodPOST.content_type vs ScrapeResponse.content_type: both fields are now named content_type. They are semantically distinct (request body encoding "json" / "form", vs. HTTP response Content-Type) and live on different models. Autocomplete will surface both — pick by context.

  • body_url is reserved: the SDK ships download_body() / download_body_sync() now so client code is forward-compatible. When the server starts offloading bodies, no client change is needed.


0.5.3 — 2026-05-04

Consumes eight new server endpoints/contracts. The recovery flow that was partial in v0.5.2 is now end-to-end: a SubmitTimeout followed by a retry no longer duplicates the batch, and find_recent_batch reattaches to the live run in a single round-trip.

Added

  • Idempotency-Key is now sent automatically on submit_batch and submit_batch_lenient. The SDK generates a fresh UUID per call so a network-level retry of the same submit (httpx.ReadTimeout, transient 5xx) returns the same collection_id without creating a duplicate run. Pass idempotency_key=... to control the key yourself; pass it through create_collection if you're using the lower-level method directly.

    Practical impact: after a SubmitTimeout, safe to retry. The server replays the original response within 24 h.

  • client.list_runs(cid) — lists every run of a collection. Available on both SyncClient and AsyncClient. Backed by GET /v1/async/collections/{cid}/runs (server-side endpoint added 2026-04-30). Optional status_filter="in_progress" or "completed".

    resp = client.list_runs(cid, status_filter="in_progress")
    for run in resp.items:
    print(run.run_id, run.status, run.created_at)
  • RunListPublic model — the response shape returned by list_runs. Exported from the top-level package.

  • Typed 404s on get_job_resultJobResultPending, JobResultExpired, JobResultLost, and JobNotFound, all inheriting from a new JobResultError (itself inheriting from APIError). The SDK parses the structured error_code the API now returns and raises the appropriate subclass:

    from scrapingpros import JobResultPending, JobResultExpired, JobNotFound
    try:
    r = client.get_job_result(cid, rid, jid)
    except JobResultPending:
    schedule_retry(jid)
    except JobResultExpired:
    requeue(jid) # > 24 h since completion
    except JobNotFound:
    log_bug(jid)

    Existing except APIError clauses continue to catch them.

  • CollectionPublic.created_at + updated_at — both float | None (Unix epoch seconds, UTC). Lets find_recent_batch(since=...) filter precisely server-side. None on older rows that pre-date the field.

  • RunPublic.created_at — same shape, available on every RunPublic returned by the API.

  • BlockedURL model — describes a URL the API refused to enqueue when creating a collection. Exposed on NewCollectionResponse.blocked_urls. Categorised by reason (private_ip, invalid_protocol, dns_failed, blocked_hostname, invalid_port, malformed_url, blocked).

Changed

  • find_recent_batch(name, since=...) now uses the server-side ?name= and ?since= filters in a single round-trip (was scanning every collection client-side in v0.5.2). It also reattaches to the live run of the recovered collection by calling list_runs(cid, status_filter="in_progress"), so the returned Batch is fully usable — no longer a partial handle with run_id="".

  • submit_batch_lenient rewritten around the new server contract. Reads NewCollectionResponse.blocked_urls directly instead of parsing HTTP 400 detail strings and retrying one URL at a time.

    Signature change: returns tuple[Batch, list[BlockedURL]] (was tuple[Batch, list[dict]] in v0.5.2). The max_drops parameter is gone — no longer a retry loop. Add idempotency_key=... if you want explicit control.

  • get_job_result 404 path — was raising APIError(404, ...), now raises a JobResultError subclass (which still isinstance checks as APIError).

Notes / migration

  • submit_batch_lenient is a behavior-compatible breaking change for the second tuple element: its type went from list[dict[str, Any]] to list[BlockedURL]. Replace dropped["url"] / dropped["__rejection_reason__"] accesses with dropped.url / dropped.reason.

  • Idempotency-Key is on by default. Pass idempotency_key=... to control the key yourself (e.g. derive it from your DB row id for a reproducible retry).

  • Older collections / runs may return null for created_at and updated_at. The SDK fields are float | None to tolerate that.


0.5.2 — 2026-04-30

Resilience and recovery release. Addresses fifteen concerns reported by integrators, focused on three failure modes: batches surviving transient API degradation, recovering from submit_batch timeouts without creating duplicate runs, and giving users the right tools the first time.

Fixed

  • Batch.iter_results() no longer dies on transient HTTP 500 / 502 / 503 / 504. Previously the polling loop only caught ConnectionError; any 5xx response raised APIError and crashed the iterator while the batch kept running server-side. Now polling distinguishes transient errors (5xx, 429, network drops) from real semantic failures (4xx) and rides out API hiccups by retrying on the next tick. The high-water mark is preserved so no progress is lost. Same fix on AsyncBatch.iter_results().

  • client.get_batch(cid, rid) now refreshes counters on construction by default, so batch.total, batch.success_count, batch.pct, etc. are populated immediately instead of staying at 0 until the first iteration tick. Pass refresh=False to skip the round-trip.

  • submit_batch() validates URLs in dict items client-side. Previously a {"url": None} or {"url": ""} would create a collection that fails downstream in a worker with an unhelpful 'NoneType' object has no attribute 'lower'. Now you get a ValueError pointing at the input index before the request is sent.

Added

  • SubmitTimeout exception (subclass of TimeoutError) raised when submit_batch() cannot reach the API in time. Distinct from polling timeouts: the batch was never created, so the message tells you what to do (search for orphans before retrying). Exported from the top-level package.

    from scrapingpros import SyncClient, SubmitTimeout

    try:
    batch = client.submit_batch(name, items)
    except SubmitTimeout:
    orphan = client.find_recent_batch(name=name)
    ...
  • submit_batch(submit_timeout=30.0) — dedicated short timeout for the submit round-trip, separate from the 120s default that covers individual scrape requests. Fail fast during API degradation instead of hanging two minutes.

  • submit_batch(on_submitted=fn) callback — fires inside the SDK call right after the collection (and again after the run) is created server-side, so you can persist (collection_id, run_id) to disk before any code that might crash. fn(collection_id, run_id_or_none) is invoked twice; on the async client it can be a coroutine.

    def remember(cid, rid):
    Path("ids.json").write_text(json.dumps({"cid": cid, "rid": rid}))
    batch = client.submit_batch(name, items, on_submitted=remember)
  • client.find_recent_batch(name, since=None) — orphan recovery helper. After a SubmitTimeout, looks up a recently created collection by exact name match and returns a Batch handle pointing at it. Use a unique name per submit (e.g. with a UUID suffix) for this to be reliable. Filtering by since is accepted for forward compatibility once the server starts returning created_at on collections.

  • client.iter_results(cid, rid) convenience shortcut — equivalent to client.get_batch(cid, rid, refresh=False).iter_results(...). Lets you stream results from a persisted (cid, rid) pair without dealing with the Batch handle. Available on both SyncClient and AsyncClient.

  • Batch.refresh() — public method to force a one-shot refresh of progress counters. Useful for monitoring loops, dashboards, or recovery scripts that want a snapshot without entering an iter_results() loop. Returns self so you can chain. Same on AsyncBatch.refresh().

  • client.submit_batch_lenient(name, items) — variant of submit_batch that drops URLs the API rejects (private IPs from DNS resolution, takedown redirects, etc.) and retries until the batch is accepted. Returns (batch, dropped). Useful when working with a pool of URLs that occasionally has flaky DNS. Sync client only for now.

Changed

  • Retry log demoted from WARNING to INFO. Under load, the SDK can fire dozens of retry log lines per minute (Request POST /v1/sync/scrape returned 500, retrying ...). Previously emitted at WARNING, drowning legitimate WARNINGs in the caller's output. Now at INFO so the default WARNING level silences them; lower the SDK logger to INFO when you need visibility:

    import logging
    logging.getLogger("scrapingpros").setLevel(logging.INFO)

    No SDK code uses print() directly, and the SDK never calls logging.basicConfig() — your logger configuration is fully respected.

Notes

  • Idempotency keys for submit_batch are not yet available server-side. Until they ship, find_recent_batch + a unique name per submit (UUID suffix) is the recommended pattern to avoid duplicates after a SubmitTimeout.

  • A few of the underlying capabilities (a runs listing endpoint, created_at on collections, structured status_filter on /jobs) need server-side work before the SDK can fully close some recovery paths. SDK-side workarounds are in place where possible; the missing pieces are tracked for follow-up.


0.5.1 — 2026-04-29

Naming and discoverability release. No new functionality on the wire — this version makes the recommended way to scale to many URLs much easier to find, and pushes back on patterns that don't scale.

Why this release exists

Three different mechanisms exist for scraping multiple pages, and the names made it easy to pick the wrong one:

  • scrape() — one URL at a time, blocking.
  • scrape_many() — opens N concurrent HTTP connections from your machine to /v1/sync/scrape. Doesn't scale.
  • submit_batch() / Collections API — sends one request, the server runs the batch with optimised concurrency. Scales to 50,000+ URLs.

Users repeatedly reached for scrape_many() (or thought switching to AsyncScrapingPros would magically improve throughput) when they actually wanted the Collections API. This release marks the misleading paths as deprecated and adds the tools that make the right path obvious.

(The original v0.5.1 release notes pointed at a "Choosing the right method" page that was folded into the Batch API doc's FAQ in v0.7.0.)

Added

  • SyncClient and AsyncClient — preferred names for the existing client classes. Same surface, same behaviour. The new names clarify that the only real difference between the two classes is the local I/O loop, not which API endpoints they can reach (both can call sync and Collections).

    # Old
    from scrapingpros import ScrapingPros, AsyncScrapingPros

    # New — recommended
    from scrapingpros import SyncClient, AsyncClient
  • client.batch_scrape(urls, ...) — convenience wrapper around submit_batch + iter_results that blocks until the batch completes and returns a flat list[ScrapeResponse]. Drop-in replacement for scrape_many when you want server-side scaling without writing the streaming loop.

    # Same shape as scrape_many, but uses the Collections API under the hood
    results = client.batch_scrape([
    {"url": u, "custom_id": product_id, "browser": True}
    for product_id, u in catalog.items()
    ])

    Available on both SyncClient and AsyncClient.

Changed

  • run_and_wait() default timeout is now 3600 seconds (1 hour), up from 300. The old default was timing out legitimate browser/stealth runs that take 20+ minutes legitimately. If you relied on the 300 s default to cap runaway runs, pass timeout=300 explicitly. (Both SyncClient.run_and_wait and AsyncClient.run_and_wait.)

Deprecated

All deprecations emit a DeprecationWarning at runtime (one-shot, location-deduped — Python's standard). Behaviour is unchanged; replacements work today. Removal targeted for v1.0.

  • ScrapingPros → use SyncClient instead. Identical class behaviour; the rename clears up the false implication that it only spoke to the sync endpoint.

  • AsyncScrapingPros → use AsyncClient instead.

  • scrape_many() → use batch_scrape() (list return) or submit_batch() + iter_results() (streaming). Server-side parallelism instead of N parallel HTTP connections from your machine.

    # This still works in v0.5.x but emits a DeprecationWarning
    results = client.scrape_many(urls)

    # Prefer one of these instead
    results = client.batch_scrape(urls) # blocking, returns list
    for r in client.submit_batch("name", urls).iter_results():
    ... # streaming, with progress

Notes

  • DeprecationWarning is filtered to "default" by Python — most production code will not see it. Tests run with strict warning filters (e.g. filterwarnings = error in pytest) will treat it as an error; either migrate to the new names or filter the specific warning.
  • The ScrapingPros and AsyncScrapingPros symbols still resolve normally, are still in __all__, and continue to work identically. Existing code does not need to change immediately.

0.5.0 — 2026-04-29

Feature release adding form-encoded POST, response body capture, attached-state waits, and richer per-job metadata. No breaking changes; two soft deprecations.

Fixed

  • MethodPOST no longer silently drops content_type. Passing content_type="form" to MethodPOST now correctly sends the body as application/x-www-form-urlencoded. Previously the parameter was discarded by the SDK and the body always went out as JSON, which broke OAuth2 grant_type=client_credentials flows and any other API that requires form-encoded payloads — the server would respond 400 invalid Content-Type and the request would fail with no obvious cause. If your scraper authenticated against an OAuth2 token endpoint and was getting empty results, upgrading to 0.5.0 fixes it.

Added

  • MethodPOST.content_type — choose "json" (default, unchanged behaviour) or "form". Use "form" for OAuth2 client_credentials flows and most legacy form-based APIs.

    from scrapingpros import MethodPOST

    resp = client.scrape(
    "https://api.example.com/v1/oauth2/token",
    http_method=MethodPOST(
    payload={"grant_type": "client_credentials", "scope": "read"},
    content_type="form",
    ),
    )
  • WaitForSelectorAction.state — accepts "visible" (server default), "attached", or "hidden". Pass "attached" to match hidden DOM nodes such as <script id="__NEXT_DATA__"> tags carrying embedded JSON, which the default visible-wait would never resolve and would always time out.

    from scrapingpros import WaitForSelectorAction

    result = client.scrape(url, browser=True, actions=[
    WaitForSelectorAction(
    selector="css:script#__NEXT_DATA__",
    time=8000,
    state="attached",
    ),
    ])
  • NetworkCaptureConfig.url_pattern — glob pattern that asks the server to capture the response body of matching requests, in addition to the usual metadata. Useful for grabbing OAuth / Firebase tokens, GraphQL persistedQuery payloads, or any internal API response without re-running the request yourself.

    Bodies are capped at 64 KB; larger responses come back with body_truncated: true. Body fetch has a 5 s timeout — if it expires, the entry gets a body_error field instead of body (the scrape itself never hangs).

    from scrapingpros import NetworkCaptureConfig

    result = client.scrape(url, browser=True, network_capture=NetworkCaptureConfig(
    resource_types=["xhr", "fetch"],
    url_pattern="*identitytoolkit.googleapis.com*",
    ))

    for entry in result.network_requests or []:
    if "body" in entry:
    token = parse_token(entry["body"])
  • JobExecutionPublic.has_extractable_data (bool | None) — whether the page contained structured data the server could extract (JSON-LD, microdata, OpenGraph, __NEXT_DATA__). Independent of is_success: a 200 page with usable text content can still have no machine-parseable payload.

  • JobExecutionPublic.validator_version (str | None) — version of the HTML Validator that produced is_success, block_reason, protection_stack, and rule_hits for the job. Pin it in integration tests to catch silent classifier upgrades:

    for job in client.iter_run_jobs(col.id, run.run_id):
    assert job.validator_version == "0.1.6"
  • JobExecutionPublic.client_id (str | None) — the client account that owns the job. Useful when working across multiple tenants.

Deprecated

Both deprecations are docstring-only — existing code keeps working with no runtime warning. They're flagged so new code can avoid them and a future major release can remove them cleanly.

  • ScrapeRequest.browser_type — the API now picks the right engine per domain via internal routing. New code should choose only between browser=True (5 credits, full rendering) and browser=False (1 credit, fast path) and let the server handle the rest. Existing code that passes "light" / "heavy" / "stealth" still works.

  • ScrapeResponse.potentiallyBlockedByCaptcha — prefer response.guidance.success for the canonical verdict. guidance also tells you why a request failed and what to try next (error_type, error_provider, next_steps, suggested_request), which the legacy boolean cannot.

    # Old
    if resp.potentiallyBlockedByCaptcha:
    retry()

    # New
    if not resp.guidance.success:
    print(resp.guidance.error_type, resp.guidance.next_steps)
    retry_with(**resp.guidance.suggested_request)

Removed

Nothing removed in 0.5.0. The deprecations above will be candidates for removal in a future major release.


0.4.3 — 2026-04-24

Added

  • JobExecutionPublic.is_success (bool | None) — server's authoritative verdict for whether a job produced usable content. Catches soft-blocks (Google CAPTCHA pages with 200 + large body, Amazon "Robot Check") that a naive status_code check misses.
  • RunPublic.success_criterion — exposes the active success policy (version, rules) so you can pin it in tests.

Changed

  • Batch.iter_results() now honours the server is_success verdict internally — result.guidance.success reflects it without extra effort.

0.4.2 — 2026-04-24

Fixed

  • Batch polling no longer hangs on transient ConnectionError during worker restarts. The SDK now retries with backoff instead of raising.
  • since_completed_at polling correctly resumes after partial failures, avoiding duplicate result delivery.
  • Guidance fallback for jobs created before the server-side guidance rollout: the SDK now reconstructs basic guidance client-side so result.guidance.success is always populated.

0.4.1 — 2026-04-23

Added

  • Cursor-based pagination for get_run_jobs() and iter_run_jobs() with server-side filters (status_filter, since_completed_at). Scales cleanly to runs with 50,000+ jobs without timing out.

0.4.0 — 2026-04-23

Added

  • Batch API — the headline feature for production-scale scraping:

    • client.submit_batch(name, requests) — submit any number of URLs at once.
    • batch.iter_results() — stream results as workers finish them, with progress (pct, eta_seconds, success_count, failed_count).
    • Per-job callbacks, automatic resume, configurable timeouts.

    Submit 50,000 URLs, walk away, come back to handled results.


0.3.0 — 2026-04-23

Added

  • ScrapeRequest.custom_id — round-trip a string through the API to map results back to your database without depending on order. Echoed in ScrapeResponse.custom_id and JobExecutionPublic.custom_id.
  • MethodPOST.url — POST to a different endpoint than the navigation target. Useful for sites that set cookies on one URL but expose data on a separate API/GraphQL endpoint.

0.2.4 — 2026-04-10

Added

  • scrape_many() extended with all scrape() parameters and a heterogeneous mode (each URL can carry its own per-request configuration).

0.2.3 — 2026-04-10

Added

  • browser_type="stealth" mode for hardened anti-bot sites.
  • block_resources (image/font/media/etc.) and block_requests (URL substring blocklist) to speed up browser scrapes by stripping unnecessary resources and trackers.

Changed

  • Default browser_type is now "light" — significantly more concurrent throughput than "heavy" for the vast majority of sites.

0.2.2 — 2026-04-09

Added

  • ScrapeGuidance on every response. Tells you why a scrape failed and what to do next, in a structured form (success, error_type, error_provider, next_steps, suggested_request, stop_reason).
  • Multi-mode viability testing — try several scraping strategies against a URL in one call to find what works before building a full pipeline.

Older versions

For releases before 0.2.2, see the PyPI release history.